import librosa
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import soundfile as sf
import matplotlib.pyplot as plt
import seaborn as sns
import IPythonAudio Seperation Outlier Detection
Background
Outlier detection is a common use-case for clustering algorithms, especially in the realm of fraud detection, spam mail, etc. Instead of fraud detection approaches, I thought audio segmentation would be another interesting application for outlier detection with clustering. This ML approach is a challenging task given the nature of how sound is used in a computational setting, which requires some DSP (digital signal processing) background.
Audio Segmentation Importance
As I was researching this problem, I found that audio outlier detection is an interesting problem that can be applied to a lot of domains. In the realm of public safety, we can use audio segmentation to seperate sounds like normal city sounds, bird chirps, as well as outlier events like gunshots. This can be applied to a full-stack system where authorities can be alerted about those outlier events autonomously.
Another cool use-case for audio segmentation is music composition. We can take any song from the internet and separate the voices based on instrument; for example, guitar, bass, drums, and vocals. Clustering algorithms can be applied to this space to separate the voices, and with a full-stack system, can be applied to music editing, genre classification, or even music generation.
My Application
For my application, I am applying audio segmentation to trail camera data. For my example, I have found an under minute long sound file containing normal forest noises: rustling branches, wind, bird chirps, that also includes an outlier (uncommon) event of a screaming mountain lion. The assumption here is that in most geographical settings, we don’t experience mountain lion screaming everyday (it’s an outlier event), and the common (non-outlier) event would be the forest sounds.
audio_file = "mountain_lion_scream.wav"
audio_data, sr = librosa.load(audio_file)
# Extract features (MFCCs as an example)
mfccs = librosa.feature.mfcc(y=audio_data, sr=sr, n_mfcc=13)
# Transpose the feature matrix to have time on the x-axis
mfccs_transposed = np.transpose(mfccs)
print(audio_data.shape) # 1D array with 1138688 components.
print(audio_data.shape[0] / sr) # 51 second audio clip.
print(mfccs.shape) # 2225 frames, each with 13 features.(1138688,)
51.641179138321995
(13, 2225)
IPython.display.Audio(r"./mountain_lion_scream.wav")